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Abstract. A tractable nonparametric prior over densities is introduced which is 
closed under sampling and exhibits proper posterior asymptotics. 

Keywords: Bayesian nonparametrics, Bayesian density estimation. 



1 Introduction 

The early 1970's witnessed Bayesian inference going nonparametric with the introduc- 
tion of statistical models with infinit e dimensiona l parameter spaces; the most conspic- 
uous being the Dirichlet Process ([Ferguson! 119731 ). which is a prior on the class of all 
probability measures over a given sampl e space th a t trad es great analytical tractabil- 
ity for a reduced support: as shown by iBlackwell (1973), its realizations are, almost 
surely, discrete probability measures. The posterior expectation of a Dirichlet Process 
is a probability measure that gives positive mass to each observed value of the sam- 
ple, making the plain Dirichlet Process unsuitable to handle inferential problems such 
as density esti mation. Many extensions and a lternatives to the Dirichlet Process have 
been proposed (|Gosh and Ramamoorthill2002r ). 



In this paper we construct a prior distribution over the class of densities with respect 
to Lebesgue measure. Given a partition in subintervals of a bounded interval of the real 
line, we define a random density whose realizations have a constant positive value on 
each subinterval of the partition. The distribution of the values of the random density 
on each subinterval is specified by transforming and conditioning a multivariate normal 
distribution. 

Our c onstruction of th e ra ndom density resembles the stochastic processes intro- 
duced by iThorburn ( 1986 ) and Lenk ( 19881) . with the following differences. Since our 
definition relies on a finite dimensional random object, instead of a more general stochas- 
tic process, our proofs are simpler, we can represent the random density directly in our 
numerical computations, instead of keeping its values on a finite number of arbitrarily 
chosen points, and we do not need to interpolate our estimat es . To make the distribu- 
tion of his random density closed under sampling, iLenk (1988) was forced to introduce a 
parameter which does not have a natural interpretation, whereas in our case the desired 
closure follows more naturally, as does the proper asymptotical behavior of our posterior 
distribution. 

An outline of the paper is as follows. In Section [3J we give the formal definition 
of a simple random density. In Section [3l we prove that the distribution of a simple 
random density is closed under sampling. The results of the simulations in Section 
H] show the asymptotic behavior of the posterior distribution. We extend the model 
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hierarchically in Section [5] to deal with random partitions. Although the usual Bayes 
estimate of a simple random density is a discontinuous density, in Section [5] we compute 
smooth estimates solving a decision problem where the states of nature are realizations 
of the simple random density and the actions are smooth densities of a suitable class. 
Additional propositions and proofs of all the results in the paper are given in Section [71 



Let (f2,^, P) be the probability space from which we induce the distributions of all 
random objects considered in the paper. For some integer fc > 1, let R?}_ be the set of 
vectors of R fc with positive components. Write S% k for the Borel sigma- field of R fc . Let 
A/c denote Lebesgue measure over (R fc ,if fc ). We omit the indexes when fc = 1. The 
components of a vector v S R are written as V\ , . . . , Ufc. 

Suppose that we have been given an interval [a, b] C R, and a set of real numbers 
A = {to, ti, . . . , tfe}, such that a = to < ii < ■ • ■ < tfe = b, inducing a partition of [a, b] 
into the fc > 1 sub-intervals [a, ti), [£i,t2)> • • • > [£fc-2)tfe-i)> [*fe-i> The class of simple 
densities with respect to this partition consists of the nonnegative simple functions that 
have a constant value on each subinterval and integrate to one. Let rfj = tj — t;_i, for 
i = l,...,fc, and define 5a : R fe — > R by Sa(u) = Each simple density 

/ : R — >■ R within this class can be represented as 



where h = (hi, . . . ,hk) S R fc is such that each /ii > 0, and Sa(/i) = L The /li's will be 
called heights of the steps of the simple density /. 

From now on, let M r = {v G : di^i + • • • + dfeW^ = r}, for r G M. Note that, 
by the definition of the dj's given above, it follows that H r = if r < 0. Also, define 
the projection on the first fc — 1 coordinates it : M. k — > R fe_1 by 7r(ui, . . . ,Vk-u v k) = 
(«!,..., Wfc-i). For a normal random vector Z — (Z\, . . . , Z^) with mean m £ R fe and 
fc x fc covariance matrix E, denote by U ~ L^(m, S) the distribution of the lognormal 
random vector {7 = (e Zl , . . . , e Zk ). If S is nonsingular, it is easy to show that U has a 
density 



where |S| is the determinant of E, logu = (logui, . . . ,logUfc) T and m = (mj, . . . ,m^) T . 

We define a random density whose realizations are simple densities with respect to 
the partition induced by A by specifying the distribution of the random vector of its 
steps heights. Informally, the steps heights will have the distribution of a lognormal 
random vector U given that Sa(U) — 1. The formal definition of the random density 
is given in terms of a version of the conditional distribution of U given Sa(U) and the 
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Figure 1: Geometrical interpretation of the measures r r of Lemma l2.1[ for r > 0, in the 
particular case when k = 3. The value of r r (A) is the area of the projection 7r(AnH r ) 
multiplied by d^ 1 . 

expression of its conditional density with respect to a dominating measure. However, we 
are outside the elementary case where the joint distribution is dominated by a product 
measure. In fact, we have in Proposition l7.1l a simple proof that Lebesgue measure A^+i 
and the joint distribution of U and Sj\(U) are mutually singular. 

A suitable family of measures that dominate the conditional distribution of U given 
Sa(U), for each value of Sa{U), is described in the following lemma. 

Lemma 2.1. Let r r : M k -> K be defined by r r (A) = d^ 1 \k-i{n(A n H r )) ; for r£l. 
Then, each r r is a measure over (R k ,& k ). 

A proof of Lemma 12.11 is given in section [7] Figure [1] gives a simple geometric 
interpretation of the measures r r when the underlying partition is formed by three 
subintervals. 

The following result is the basis for the formal definition of the random density. 

Theorem 2.2. Let U ~ Lk{m, S), with nonsingular S, and let {r r } re H be the family of 
measures over {R k ,& k ) defined on Lemma \2J\ Then, Hu\S&{U) '■ & k x ^+ ~~ > K defined 





Moreover, Hv\s^{U)^r I r ) = 1> f or each r > 0. 
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The necessary lemmata and a proof of Theorem 12.21 are given in Section [7J The 
following definition of the random density uses the specific version of the conditional 
distribution constructed in Theorem l2.2l 

Definition 2.3. Let U ~ Lk(m, E), with nonsingular E. We say that ip : R x 51 — > R 
defined by 

k 

= ^irt(w)/[ ti _ 1)t4 )(a;) 

i=l 

is a simple random density, where H = (Hi, . . . , H^) are the random heights of the 
steps of <p, with distribution given by (J-h(A) = Pu\s A (u)(A I 1)> f° r A € , where 
A t [/|s A ([7) is the regular version of the conditional distribution of {J given Sa(U) obtained 
in Theorem 12.21 Hence, for every A S ^ , we have 

/SA(c/)Uj 

where ti(A) = (i^ 1 Afc_i(7r(AnIHIi)) and it holds that /xjj(Hi) = 1. We use the notation 
y> ~ A(m, E). 

3 Conditional Model 

Now we model a set of absolutely continuous observables conditionally, given the value 
of a simple random density ip. The following lemma, proved in Section [Jj describes the 
conditional model and determines the form of the likelihood. 

Lemma 3.1. Let ip ~ A(m,E) with representation p(x,uj) = ^ i=1 Hi(ui) L ti _ ltt A(x). 
Suppose that the random variables Xi , . . . , X n are conditionally independent and iden- 
tically distributed, given that H = h, with distribution pl Xi \h(A | h) — f A f(y)d\(y), 
where we have defined f(y) = ^2i—ihilu i _ li t i \(y). Define X = (Xi, . . . , X n ) and let 
x = (xi, . . . , x n ) £ R". Then, Hx\h( ' I h) ^ ^n> almost surely [/iff], with Radon- 
Nikodym derivative 

d ^(x\h)=f x \ H (x\h) = X{ht, 

1—1 

where a = Y^=ihu-i,U)( x j)> for i = I, . . . , k. 

The factorization criterion yields that c = (ci,... ,c„) is a sufficient statistic for ip. 
That is, in this conditional model, as one should expect, all the sample information is 
contained in the countings of how many sample points belong to each subinterval of the 
partition induced by A. 

Using the notation of Lemma |3.1[ and defining c = (c\, . . . , Cfe) T , we can prove that 
the prior distribution of <p is closed under sampling. 

Theorem 3.2. if <p ~ A(m, E), then p \ X = x ~ A(m*, E), where m* = m + Ec. 
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This result, proved in Section [71 has practical consequences, as it makes the simula- 
tions of prior and posterior distributions essentially the same, the only difference being 
the computation of to* . 



4 Stochastic Simulations 

We summarize the distribution of a simple random density tp ~ A (to, E), represented 
as (f(x,u>) = Yli=i I] it ._ l t .)(x), in two ways. First, motivated by the fact, proved 



in Proposition 17.51 that the prior and posterior expectations are predictive densities, 
we take as an estimate the expectation of the steps heights h = (E[i2i], • • • > E[J?^]). 
Second, the uncertainty of this estimate is assessed defining 



B(h, e) = ih e Hi : d(h, h) < e| , 



for e > 0, and taking as a credible set the B(h, e) with the smallest positive e such that 
P{(jj : H(uj) 6 B(h 7 e)} = 7, where 7 e (0, 1) is the credibility level. 

The Random Walk Metropolis algorithm ( Robert and Casella 2004) is used to draw 



dependent realizations of the steps of <p as values of a Markov chain {H^}i>Q. The 
two summaries are computed through ergodic means of this chain. For example, the 
credible set is determined with the help of the almost sure convergence of 



1 N 

ArX^CM M ^ E [W^] =P{u>:H( U )eB{h,e)} 



i=0 

As for the parameters appearing in Definition ^. 3[ we take in our experiments all the 
TOj's equal to one, and the covariance matrix E = (<Jij) is chosen in the following way. 
Given some positive definite covariance function C : R x R — » K, we induce E from C 
defining 

= C ( 



t>i—l~\~ti tj—i ~\~ tj 



for i,j = 1, . . . , k. In our examples we study the family of Gaussian covariance func- 
tions defined by C p ^(x,y) — pe^ el ^ x ~ v ^ , with dispersion parameter p > and scale 
parameter 9 > 0. 

Example 4.1. Let ip ~ A(to, E) and consider the sample space [0,1] with A = 
{0, 0.01, 0.02, . . . , 0.98, 0.99, 1}. For the sake of generality, we induce E from the family 
of Gaussian covariance functions with fixed dispersion parameter po but with random 
scale parameter 6 — Y + 20 000, where Y ~ Gamma(2, 0.001). These choices guarantee 
that computations with E are numerically stable. In Figure [2] the summaries of the 
prior distribution of ip show that the value of po controls the concentration of the prior. 
Fixing po = 0.05 and generating data from a mixture 

i • Beta(l, 10) + i • Beta(l0, 10) + i • Beta(30, 5) , 



p = 0.01 
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Figure 2: Effect of the value of po on the concentration of the prior. The curves in 
black are prior expectations and the gray regions are credible sets with credibility level 
of 95%. 



we have in Figure [3] the posterior summaries for different sample sizes. Note the con- 
centration of the posterior as we increase the size of the samples. I 

We observe the same asymptotic behavior of the posterior distribution with data 
coming from a triangular distribution and a mixture of normals, where in the second 
case we truncate the sample space appropriately. 



5 Random Partitions 

Infcrcntially, we have a richer construction when the definition of the simple random 
density involves a random partition. Informally, we want a model for the random density 
where the underlying partition adapts itself according to the information contained in 
the data. 

We consider a family of uniform partitions of a given interval [a, b]. Each partition of 
this family will be described by a positive integer random variable K, which determines 
the number of subintervals in the partition. Since the parameter p of the family of 
Gaussian covariance functions used to induce S may have different meanings for different 
partitions, we treat it as a positive random variable R. 

Explicitly, we are considering the following hierarchical model: K and R are inde- 
pendent. Given that K = k e R — p, we choose the uniform partition of the interval 
[a, b] induced by 

2(fc~a) , (fc -!)(&- a) .1 
___,..., a + _ 

induce £ Pi g from the family of Gaussian covariance functions, and make tp ~ A(m, £p,e). 
Finally, the observables are modeled as in Lemma 13.11 This hierarchy is described in 
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Figure 3: Posterior summaries for Example 14. II On each graph, the black simple density 
is the estimate <p, the light gray region is a credible set with credibility level of 95%, 
and the dark gray curve is the data generating density. 
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Figure 4: Posterior summaries for Example l5.ll The black simple density is the estimate 
<f, the light gray region is a credible set with credibility 95%, and the dark gray curve 
is the data generating density. 



the following graph. 




In the following example we follow an empirical path: instead of specifying priors 
for K and R, we define the likelihood of K and R by L x (k, p) — fx\K.R{ x I k, p), whose 
form is determined in Proposition 17. 6[ find the maximum (k, p) = argmax^p L x {k, p), 
and use these values in the definitions of the prior, determining the posterior summaries 
as we did in Section 2) 

Example 5.1. With a sample of size 2 000 generated from a Beta(4, 2) distribution, 
we find the maximum of the likelihood of K and R at (k, p) = (9, 1.43). In Figured] 
we have the posterior summaries obtained using these values in the definition of the 
prior. Moreover, in the left graph of Figure [5] we have the distribution function F 
corresponding to the estimated posterior density. For the sake of comparison, we plot 
in the right graph of Figure [5] some quantiles of this distribution F against the quantiles 
of the distribution F from which we generated the data. I 
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Figure 5: Example 15. II On the left graph, the black curve is the estimated distribution 
function F and the gray curve is the data generating distribution function Fq. On the 
right graph, we have the comparison of some of the quantiles of F and Fq. 



6 Smooth Estimates 

It is possible to go beyond the discontinuous densities obtained as estimates in the last 
two sections and get smooth estimates of a simple random density if solving a Bayesian 
decision problem where the states of nature are the realizations of ip and the actions are 
smooth densities of a suitable class. 

In view of Theorem 13. 2\ its enough to consider the problem without data. As 
before, the sample space is the interval [a, b], which is partitioned according with some 
A. For some density / with respect to Lebesgue measure, we denote its Li norm by 

||/|| 2 = (// 2 dA) 1/a . 

Proposition 6.1. For N > 1, let g\, . . . ,gw be densities with respect to Lebesgue mea- 
sure, with support [a, b], such that \\giW2 < 00, and let & be the class of densities of the 
form Ylj—i on gi, with c% > 0, for i = 1, . . . , N , and X^i=i a i = !■ f ~ A(m, S) and 
define .y as the class of densities which are realizations of (p. Define the loss function 
L-.yx&^Rby' 

L{sJ)=\\s-f\\ 2 2 = f \ s {x)-f{x)f d\{x). 

J a 

Then, the Bayes decision is <p = YliLi ®i 9i> where dj minimize globally the quadratic 
form 

N N 

subject to the constraints ai > 0, for i = 1, . . . , N , and a,; = 1, with the definitions 

r-b r-b 

Mij = / g i {x)g j {x)dX{x) e Ji = 2 gi(x)E[(p(x)] dX(x) . 
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Figure 6: Example I6.2I On the right graph, the black simple density is the estimate <p, 
and the light gray region is a credible set with credibility 95%. On both graphs the dark 
gray curve is the data generating density. On the left graph, the black smooth density 
is the Bayes decision of Proposition I6.ll 



We use the result of Proposition 16. II proved in Section [3 choosing the <?;'s inside a 
class of smooth densities that serve approximately as a basis to represent any continuous 
density with the specified support. 

For the next example, suppose that the support of the densities is the interval [0,1]. 
Bernstein's Theorem (see iBillingslevi (|1995l ). Theorem 6.2) states that the polynomial 



x) N ~ 



approximates uniformly any continuous function / defined on [0,1], when N 
Suppose that / is a density. If we define, for i = 0, . . . , N, 

j i \ /n\ r(i + i)r(iv - i + 1) 
a i = i \ict 



N J \i J T(N + 2) 

we can rewrite the approximating polynomial as Bjy (x) — X)i=o ai 9* ( x ) > wnere 9i is a 
density of a random variable with distribution Beta(i + 1, N — i + 1). Hence, if we take 
a sufficiently large N, we expect that any continuous density with support [0, 1] will be 
reasonably approximated by a mixture of these g^s. 

Example 6.2. Suppose that we have a sample of 5 000 data simulated from a truncated 
exponential distribution, whose density is 

2 £-2(2-1) 

fo(x) = — 2 - I[ ,l] 0) • 

Repeating the analysis made in Example 15. 11 we find the maximum of the likelihood of 
K c R at (k, pi) = (9, 0.86). The left graph of Figure [6] presents the posterior summaries. 
After that, we solved the problem of constrained optimization of Proposition 16.11 and 
found the results shown in the right graph of Figure [S] I 
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7 Additional Results and Proofs 

In this section we present some auxiliary propositions and give proofs to all the results 
stated in the paper. 

Proposition 7.1. Let U ~ Lk(m, E) and denote by Hu,S/\{u) ^ e joint distribution of 
U and S A (U). Then, Hu,s*.(U) -L -Wi- 



Proof. Define the set A = {« G K fc+1 : Ei=i = e @ k+1 - Then, 

Mc/,s A (£/)(^) = P : S A (U(u>))) G A} = P L : £ d,Di(w) = S A (tf(w)) j = 1, 

by definition of Sa- On the other hand, note that A^+i (A) = 0, since this is the (k + 1)- 
volume of the fc-dimensional hyperplane defined by the set A. Since [iy .s&(U)(A c ) = 0, 
the result follows. I 



Proof of Lemma 12.11 When r < 0, the result is trivial, since in this case H r = 0, 
making r r a null measure. Suppose that r > and let g : R fc —> R fe be the function 
defined by 



g(v) = Vi,...,Vk-i 



Define h r : R k ~ 1 -> M fe by /i r (y) = £f(j/,r). We will show that n{A n H r ) = /i" 

for every A € M. Suppose that y € 7r(A n BL-). Then, there is a w £ A n H r such that 

2/ = tt(v) = (vi, Vk-i) and 



h r (y) = g(y,r) = «i, 



Since u G H r , we have that ^- — X^i 1 ^^ij = v k, implying that h r (y) = v. Since 

v G A, it follows from the definition of the inverse image of h r that y G h~ 1 {A) and, 
therefore, we conclude that 7r(^4nEI r ) c h~ 1 (A). To prove the other inclusion, suppose 
that y G h~ 1 {A) and define v = h r (y). Hence, v G A and by the definition of h r we 
have that 




g(y,r) = [yi,...,y k 



implying that u G H r , because J2i=i di v i — r - Since v G Afli r and y = ir(v), it 
follows that y G 7r(i4nl r ). Therefore, h^ 1 (A) C jr(inH r ). Hence, we have that 
r r = d^ 1 \k o h~ x and the usual properties of the inverse image of /i r and the Lebesgue 
measure entail that each r r is a measure over (R fc ,M k ). I 
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Lemma 7.2. Let U ~ L k (m,Y>). Let defined by £(A) = \k{u G K+ : (u,Sa(u)) G 
A}, be a measure over (R fc+1 ,,S? fc+1 ). Denote by ^u,S A (U) the joint distribution of 
U and Sa{U). Then, we have that (J-u.s A (U) ^ with Radon- Nikodym derivative 
d/J.u,S A (u)/d£ = fu,S A (U) give™ b V 

fu,s A (U)(u,r) = fu{u)Im r ( u ) > 

where u G R k and rel. 



Proof. Define the function T : -> R fe+1 by T(u) = (u,5 A (u)). Note that £ = 
A fc o T" 1 . Define the function ip : M^ 1 -> K by ip(u,r) = fu(u)Ia r (u), with u (E R k 
and rel. The diagram 

tk T ^ mfc+1 




commutes, since ip(T(u)) — ip(u, Sa(u)) — fu(u) Iu s&{u) (u) = fu( u ), for every u G K* . 
For every A G ^ fc+1 , we have that 

Vu,s A (u)(A) = P{uj : {U{u),Sa{U{u))) e A} = P{uj : G T^ 1 (A)} 
fu(u)d\k(u) = / ip(T{u))d\ k (u) 

T- 1 (A) JT- 1 (A) 

i/>(u,r)d£(u,r) = / fu(u) L Ur (u) d£(u,r) , 

A J A 

where the fifth equality is obtained transforming by T, u G R k and r G R. It follows 
that Hu,S A (u) £ an d the Radon-Nikodym derivative has the desired expression. I 

Lemma 7.3. Lett; be the measure defined on Lemma \7.2\ and let {r r } re R be the family of 
measures defined on Lemma \2.1\ Then, for every measurable nonnegative ip • "' ' "' 1 



^>(u, r) d£(u, r) 



ip(u,r) dr r (u) I d\(r) 



where u £ R k and r G 



Proof. Define / : R k — > R k by f(u) — (m, . . . , Uk-i,J2i=i diui). Hence, / is a diffcr- 
entiable function whose inverse is the differentiable function g defined on Lemma 12.11 
The value of the Jacobian on the point v G R k is J g (v) = d^ 1 . Let A 6 £$ k , y e R k ^ 1 , 
r£l, and define h r as in Lemma [2Tl When r > 0, we have already shown in the course 
of the proof of Lemma |2~T1 that n(A n H r ) = hi' 1 (A), for every A G 3$ k . Remembering 
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that, by definition, EL- C R+, it follows that 7r(AnH r ) = h~ 1 (A n R+ ) and we conclude 
that ^ 7r (AnH r .)(y) = ^AnR fc (#(2/; r ))- Now suppose that r < 0. In this case, since H r = 0, 
we have that I- K (ArM T ){y) = -^(y) = 0- As for the value of I AnR k^(g(y,r)), consider two 
subcases: since 

9(y,r) = ^yi,...,yk-i,^ ~Yl diy yj ' 

if any of the yi < 0, then I An ^k_ (3(2/ , r)) = 0, otherwise, we have ^- — X^iLi 1 <^ifij < 
and again it happens that I AnR k^(g(y, r)) = 0. Therefore, we conclude that in this case 
also I^^AnUr) (v) — ^4nR^(ff(lA r ))- Hence, for A £ 3t k and B £ ffl, we have that 

£(A x B) = X k {u £R k + :u£ A,S A {u) £ B} = f I Am „ (u) I B {S A {u)) dX k (u) 
(g(y,r))I B (r) \Jg(y,r)\dX k (y,r) 
d k 1 In(Arm r )(y)lB(r) dX k (y, r) 

(dl 1 I dA fc _i(y)| dX(r) = [ r r {A)dX{r), 



7r(AnH r ) J JB 

where y £ R fc_1 and r£l, the third equality is obtained transforming by /, and the 
penultimate is a consequence of Tonelli's Theo rem. The r esult follows from the Product 
Measure Theorem and Fubini's Theorem (see Ashl (|2000t ). Theorems 2.6.2 e 2.6.4). I 



Lemma 7.4. Let U ~ L/-(m, £). Let {r r } rg K 6e the family of measure defined on 
Lemma \2.1\ Let ^s A (U) oe th £ distribution of Sa(U). Then, Hs A (u) ^ ^ with Radon- 



Nikodym derivative dfi SA (u)/dX = fs & (u) given by fs A (u)(r) = / fu(u) Im r (u) dr r (u). 

Proof. Let A £ 3%, u £ R fc , and r £ R. Let £ be the measure defined on Lemma T7.2I 
We have that 

MSa(ioCA) = ^{^ : Sa(U(lo)) £ A} = P{lo : U(u) £ R k ,S A (U{u)) £ A} 
= / i [/iSA([/) (M fe x A) = fu(u)Im r (u)<%(u,r) 

Jl'xA 

/ fu(u) Im r (u) dr r (u) ) dA(r) , 
A \JR fc / 

where the penultimate equality follows from Lemma 17.21 and the last equality follows 
from Lemma f7.3l Hence, fJ-s A (U) ^ ^ an d the Radon-Nikodym derivative has the desired 
expression. I 
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Proof of Theorem 12.21 Let Hu.s&(U) be the joint distribution of U and Sa(U), and 
let Hs A (u) be the distribution of Sa(U). For A £ £% k and B £ by the definition of 
conditional distribution, we have that 

Vu,s*(u)(AxB) = P{U £A,S A (U)£B} = / Hu\s A (U)(A\r)dii SA (U)( r ) 

Jb 



ms ^uM\r) d -^l{r)d\{r). 



where we have used the Leibniz rule for the Radon-Nikodym derivatives. On the other 
hand, by Lemmas 17.21 and 17.31 we have that 

Vu,S A (U)(A x B) = / fu(u) I Mr (u)d£(u,r) 

JAxB 



fu(u) Iu r (u) dr r (u) dX(r) , 
Jb \ja J 

with u £ K fe and r?B. Both expressions for Hu,s&(u) (A x B) are compatible if 

J fu(u) I Wr (u) dr r (u) 
Vu\Sa(U)(A I r) = - 



fs A (u)(r) 

for almost every r [A]. Therefore, we have that (J>u\S&.(U)( ' I r ) ^ T r, for almost every 
r > [A], with Radon-Nikodym derivative d/iu\s&(u)/dT r = fu\SA(U)( ' I r ) gi ven by 

Ju\s A (u) (u\r) = 7-r I Mr [u) , 

as desired. The fact that I^uISaOJ) (Blr I r ) = 1 follows immediately. I 

Proof of Lemma 13.11 Let ah be the measures over (R™, J" 1 ) defined by cth{A) = 
I A (U.i=i K*) dXn(x), for each h £ H x . Let B = Bi x ■ ■ ■ x B n , with e ^, for 
i = 1, . . . , n. By the hypothesis of conditional independence and Tonelli's Theorem, we 
have that 

n n - „ / n \ 

to|H(5 I ft) = II A*^|h(^ \h) = Y[ f(*i) dX(xj) = / MI /(^i) d ^ x ) 
3=1 3=i Jb i Jb \j=l J 

= j B fnE^ViA)^ d\ n {x) = d\ n (x) = a h {B) . 

Hence, Hx\h( • I ' l ) and ct^ agr ee on the 7 r -syste m of product sets that generate 3? n . 
Therefore, by Theorem A. 26 of ISchervishl (1 19951) . both measures agree on the whole 
sigma- field It follows that Hx\h{ • | h) -C A„, almost surely [/Xfr], and the Radon- 
Nikodym derivative has the desired expression. I 
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Proof of Theorem 13.21 By Bayes Theorem, for each A € M , we have that 

mx {A 1 x) = C,Jj x[H (x | h)d»„(k) = ft jf fn ftf'J <4w(A) 

^ c °/.(n ft? )^f ( ' i)dr ' ( ' ,) 



. / k \ 

C " 1 t\ h i) fu{h)l Ml (h)dn(h) 



where we have used the expression of the likelihood obtained on Lemma [5TT1 the Leibniz 
rule for the Radon- Nikodym derivatives, the expression of dfin / dr\ in Definition ^. 31 and 
the constant Cq is such th&t/imxffii [ £c) = 1- The remainder of the proof depends on 
some matrix algebra. Let / be the identity matrix. Since, by definition, E is symmetric 
, we have that I = I T = (EE" 1 ) 1 " = (E~ 1 ) T I] T = (E^^E. Therefore, we have that 
(E _1 ) T = E -1 . Write I = \ogh. Since the scalar / t E _1 to is equal to its transpose 
(Z T E _1 m) T = m T T,- 1 l, we have that (7 - m) T E -1 (J - m) = Z T E -1 Z - 2to t E- 1 Z + 
rn T E _1 m. Defining d — E c, we have 

(j[ htj exp (-1(1 rn*) T E- 1 (/ - m*)" 

= exp (~ (-2d T E- 1 / + l T £-H - 2m T E" 1 Z + TO T E _1 m)^ 
= Ci exp (~ (-2d T E~ 1 / + l T ^l - 2m T E~ 1 Z + m T E _1 m) 

+ 2m T E" 1 d + rf T E" 1 d 



with Ci = exp(-(l/2) (-2m T E- 1 d-d T E- 1 d)). Define m* = m + d. Since the 
scalar d T T,- 1 m = (d T E" 1 m) T = m T E _1 d, we have that (m*) T E" 1 ?7i* = m T T,- 1 m + 
2m T E~ 1 d + d T E _1 (i. Hence, we obtain 



J[hA exp(-±(l-m*) T V-\l-m*f) 



= Ci exp (~ (/ T E-^ - 2(to*) t E~ 1 Z + (m*) T E- 1 m* 
= Ci exp - m*) T E -1 (Z ~ m*) 

Using this result in the expression of Hh\x together with the expression of fjj, we have 
mx (A | x) = C 2 f fu, (h) J Hl (h) d Tl (h) , 

J A 
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Simple Random Densities 



where C2 = (Co Ci)//s A (t/)(l) and fu* is a density of the random vector U* ~ 
Lfc(m*,E). We conclude that, given that X = x, the vector H has the distribution 
of the heights of the steps of a simple random density ip* ~ A(m*, S), as desired. I 



Proposition 7.5. Suppose that the random variables X\ , . . . , X n +i are modeled ac- 
cording to Lemma \3.1i Denote by ji Xi the distribution of X\, for i = + 1. 
For convenience, use the notations X^ n ' — (Xi, . . . , X n ) and x^ — (xi, . . . , x n ) 6 
Then, for every A S we have 

(a) fj, Xi (A) = / E[y>(y)] dA(y), /or t = 1,.. .,n+ 1; 

W Mx„ +1 |xw(^ I * (n) ) = / E[^(y) I I W - iW] d\(y), almost surely [p x(n) }. 

J A 



Proof. By Definition 12. 31 we have 



E[<p(y)] = E 



^2 Hi I [ti _ uti) (y) 



f(y)dnH{h) . 



where h G R fe and f(y) — X)i=i ^i-fy«_i,ti)(y)) f° r 2/ ^ M. In an analogous manner, we 
have 



Efr>(y) I = x 



(n) _ ^(n)i _ 



For item (a), note that 



H Xi {A) = P{Xi eA,He R k } = [ nxM A \ h)d m {h) 

Jul* 

( f f{y)d\(y)]d m {h)= 1(1 f{y)d m {h))d\{y) 

M k \JA J J A \JR k 

E[tp(y)] d\(y) , 



where the fourth equality follows from Tonelli's Theorem. For item (b), for each B 6 3$ n , 
we have 



P{X n+1 eA,X^eB} = / n Xn+l]xM (A\xW)dvi xW (xM). 
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On the other hand, we have 

P{X n+1 e A,X™ eB} = P{X n+1 e A,X^ eB,H e R k } 

JBxR k 

Vx n+1 \H(A | h)dfj, xMtH {x M ,h) 

l k 

U»x n+1 \ H ( A I h)dfi mxM (h | x™)\ d^ x(n) {x^) 
f(y) dX(y)^j dp H{xM (h | «W)J d/ix<")(^ (n) ) 



BxJ 



b \Jm k \Ja 



B \JA 



EMy)\Xto=xM]d\(y))dv xM (xW), 

B \JA / 

where the third e quality fol l ows fr om the hypothesis of conditional independence and 
Theorem B.61 of Sch ervish (Il995l) . the fourth equality is a consequence of Theorem 



2.6.4 of |As3 (2000), and the sixth equality is due to Tonelli's Theorem. Comparing 



both expressions for P{X n+ i e A,X^ S B}, we get the desired result. 

Proposition 7.6. Let /.ik = P o K^ 1 over (N, 2 N ) be the distribution of K and let 
/ir = PoR -1 over (WL,M)be the distribution of R. Denote by hk,b. the joint distribution 
of K and R, which by the independence of K and R is equal to the product measure 
(J-K x fiR, and let /j,k,r,h be the joint distribution of K, R and H. In the hierarchical 
model described on Section^ we have that P> x \k.r{ ■ | k, p) <C X n , almost surely [fiK,n], 
with Radon-Nikodym derivative 

Vx\k,r ^ 1 fe )j0 ) _ f x \ K R {x \k,p)= [ fx\n[x I h) dn H \K,R{h I k, p) , 

for the f x \H defined on Lemma \3.1\ 



Proof. Let A € M n and B € 2 N (g) By the definition of conditional distribution, we 
have 

P{X e A, (K, R) e B} = [ p, x \K,R( A \k,p)d f i K , R (k,p). 

Jb 



On the other hand, by arguments similar to those used in the proof of Proposition I7.5[ 
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we have 

P{X G A,(K,R) G B) 

= P{X G A, {K, R)eB,H G K fc } 

Mxi^ii.K^ I k, p, h) dnK,R,H(k, p, h) 

BxR k 

Hx\h{A I h)dp K . R . H (k,p, h) 
fJ, x \ H (A | h)dp H , KM (h \k,p)) dp K , R (k,p) 



Ix\h(x I h)d\ n (x)j dp H \ KtR (h \ k,p)j dp K . R {k,p) 

Jx\h(x I h) dp H \ K ^ R (h \k,p)\ d\ n (x) I dp K . R (k, p) . 
ib \JA \Jm k J J 

Comparing both expressions for P{X G A, (K, R) G B}, we have 

Vx\k,r( A \ k iP)= I ( / Ix\h{x I h)dp H \ K:R {h \k,p) \ d\ n {x) . 

J A \jR k J 

almost surely [hk,r], and the result follows. 

Proof of Proposition 16.11 By Tonclli's Theorem, the expected loss is 
E[L(<p, /)] = / f 2 (x) d\(x) - 2 / f(x)E[<p(x)] d\{x) + C , 



where we have defined the positive constant Co = E[ip 2 (x)] d\(x) . By hypothesis, 



gi(x)E[<p(x)}d\(x) +C* , 



each / has the form f(x) = J2i=i a i9i( x )> leading us to 
E[L(p,/)] = I 9i{x)9j(x)d\(x)]-2Y^ K / 

i,j=i \ Ja I i=i V Ja 

where we have used the linearity of the integral. Therefore, minimizing the expected loss 
is the same as solving the problem of constrained minimization of the quadratic form 
Q. For the matrix M ~ (Afy), note that, for every nonnull y — (j/i, . • . , Un) T & 1*, 
we have 

Jv n / b \ 

V T My = ViVj M ij = VMS l 9i(x)gj(x)dX(x) 

i,j=i t,j=i \ Ja J 

pb N . b / N \ 2 

= Y (yi9i(x)yj gj(x)) d\(x) = / ^j/i&fc) dX ( x ) > °> 

Ja Ja \i=l / 
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where we have use d the linearity of the i ntegra l. Therefore, the matrix M is positive def- 
inite, yielding (see lBazaraa and Shettv ( 20061) ) that the quadratic form Q is convex and 
the problem of constrained minimization of Q has a single global solution (d?i, . . . , ajv). 
Since the Bayes decision is the / that minimizes the expected loss, the result follows. I 
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